In addition to visualization, we generally want to be able to summarize and compare characteristics like:
Central Tendency: “typical values” of the variable
Dispersion: the amount of spread around the central tendency
Modality: the number of “peaks” or “modes” in a distribution.
Skewness: the amount of asymmetry in a variable.
(some things you probably remember from school)
Sum up all the numbers and divide by the total number of observations
\[ \bar{x} = \frac{1}{n}\sum^n_{i=1}x_i \]
\[ \bar{x} = \text{the mean of x} \]
\[ x_i = \text{the individual values of x} \]
\[ n = \text{the number of observations} \]
A useful feature of the mean: the summed residuals from \(\bar{x}-x = 0\)
| \(x\) | \(x-\bar{x}\) |
|---|---|
| 3 | \(3 - 6 = -3\) |
| 4 | \(4 - 6 = -2\) |
| 6 | \(6-6 = 0\) |
| 11 | \(11 - 6 = 5\) |
| Total: \(24\) Mean: \(24/6 = 6\) |
Total: \(-3 + -2 + 0 + 5 = 0\) |
A problematic feature of the mean is that its sensitive to extreme outliers (also known as skew)
For an even number of observations, the median is the middle number:
\[ x = 1, 3, 3,6, 7, 8, 9 \]
\[ \text{Median} = 6 \]
For an odd number of observations, the median is the mean of the two middle values:
\[ x = 1, 3, 3,6, 7, 8, 9, 11 \]
\[ \text{Median} = 6.5 \]
Importantly, the median is a skew-robust measure of central tendency.
The mean and median will be similar if there’s no skew:
\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]
\[ \text{Median of x} = 6.5 \]
\[ \text{Mean of x} = 6 \]
But they diverge when we include extreme outliers:
\[ x = 1, 3, 3, 6, 7, 8, 9, 100000000000 \]
\[ \text{Median of x} = 6.5 \]
\[ \text{Mean of x} = 12500000000 \]
The modal value is the value that occurs most often.
\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]
\[ \text{Mode of x = 3} \]
Unlike the mean, the mode is a valid measure of central tendency for nominal variables:
\[ \text{Tom, Earl, Tom, Sarah, Beth} \]
\[ \text{Mode = Tom} \]
Variables may have more than one modal value. For instance, the cross-national distribution of male average years of schooling is roughly bimodal
By contrast the % of a country’s population that is working-age is unimodal: most countries have between 65 to 70% and there’s no other value that is nearly as common.
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Mean | ❌ | ❌ | ✅ |
| Median | ❌ | ✅ | ✅ |
| Mode | ✅ | ✅ | ✅ |
The standard deviation is a … standard measure of dispersion for interval variables based on squared deviations from the mean.
A larger standard deviation, all else equal, indicates that observations tend to deviate from the mean more.
To calculate the standard deviation for a sample:
1. Calculate \(\bar{x}\) (the mean of \(x\))
2. Calculate the residual (\(\bar{x} - x_i\)) for each value
3. Square each residual and sum.
3. Calculate the variance by dividing this total by the number of observations (minus 1)
4. Calculate the standard deviation by taking the square root of the variance.
| x | Deviation from mean (5) | Differences squared |
|---|---|---|
| 2 | -3 | 9 |
| 4 | -1 | 1 |
| 4 | -1 | 1 |
| 4 | -1 | 1 |
| 5 | 0 | 0 |
| 5 | 0 | 0 |
| 5 | 0 | 0 |
| 7 | 2 | 4 |
| 9 | 4 | 16 |
| Mean = 5 | Total = 0 | TSS = 32 |
\[\text{Var(x)}=\frac{32}{(9-1)} = 4\] \[s_x = \sqrt4 = 2\]
Fortunately, we don’t have to do this by hand:
The key thing to remember is just that the standard deviation is sort of like “an average of differences from the average”
Mean and standard deviation will come up a lot.
In formulas, you’ll often see standard deviation represented using the Greek letter \(\sigma\).
The mean is often represented with the Greek letter \(\mu\).
Range is simply the difference between the lowest and highest value
Interquartile Range is the difference between the 25th and 75th quartile of a variable
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Standard Deviation | ❌ | ❌ | ✅ |
| IQR | ❌ | ✅ | ✅ |
Skew refers to the degree of asymmetry in data.
When the distribution is basically symmetric, the mean and the median essentially overlap.
With right skew, extreme high values pull the mean higher than the median.
The are multiple measures of skewness, but the one you’re likely to encounter is Fisher’s moment coefficient of skewness
Nevertheless, in practice checking the difference between the median and mean is a good way to identify skew.
Centering a variable means subtracting that variable’s mean from each individual observation, giving it a mean of 0.
Scaling a variable means dividing each observation by that variable’s standard deviation (a measure of variability). Giving it a standard deviation of 1.
Centering and standardizing (also called z-scaling) can make it feasible to compare two variables with different scales.
Z standardization gives us a way to compare these variables on similar scales:
Log transformations are often used on variables that take on positive values greater than 0.
Remember that \(log_b(x)\) is the power by which \(b\) must be raised to equal \(x\). So \(log_{10}(100) = 2\) because \(10^2 = 10 \times 10 = 100\).
And \(log_{10}(1000) = 3\), \(log_{10}(10000) = 4\) and so on…
Log transformations are often used to compress skewed distributions:
They can also make non-linear relationships into linear relationships, which can make them easier to work with graphically and mathematically: